Getting started with read QC

…BRIEF INTRO IN PROGRESS…


Snakemake workflow for read QC

A tentative snakemake workflow that defines read quality control rules in a DAG (directed acyclic graph) format. A detailed interactive snakemake report is available here. Use a wider screen to get a better interactive snakemake report.

Some potential QC tools

  • Seqkit
  • Fastqc
  • MultiQC
  • BBDuk script
  • Trimmomatic
  • Kneaddata

Some QC resources

  • Adapter fasta files
  • PhiX fasta file

Tool dictionary (environment.yml)

name: readqc
channels:
    - bioconda
    - biobakery
dependencies:
    - seqkit =2.3.1
    - fastqc =0.12.1
    - multiqc =1.14
    - bbmap =39.01
    - trimmomatic =0.39
    - knead-data =0.12.0
conda activate base
mamba install -c bioconda -c conda-forge -n readqc -file environment.yml





Read simple statistics

Assuming that the seqkit installation was successful, we can use it to get the simple statistics of the reads. Later we will use the seqkit output to prepare sample mapping files automatically.

  • If the files are uncompressed, we can save space by compressing them.
  • Let’s navigate to the folder containing the fastq files and compress them using gzip function.
gzip *.fastq

From this point forward, we will assume that all the fastq files are in fastq.gz format.

#!/bin/bash

echo PROGRESS: Getting stats of the raw reads.

INPUTDIR="resources/reads"
SEQKIT="results/qc/seqkit1"
mkdir -p "${SEQKIT}"
seqkit stat "${INPUTDIR}"/*.fastq.gz >"${SEQKIT}"/seqkit_stats.txt


Read Quality Control

  • Assuming that most QC tools are ready, it is time to use them to do the following:
    • Check the quality of the reads using fastqc.
    • Create a summary report of quality metrics using multiqc.
    • Trim poor read at a user-specified cutoff using bbduk.sh.
    • Remove contaminants bbduk.sh.


QC on raw reads


QC after trimming poor reads


QC after removing contaminated reads


Processed read status





References

[1]
Buza, T. M., Tonui, T., Stomeo, F., Tiambo, C., Katani, R., Schilling, M., … Kapur, V. (2019). iMAP: An integrated bioinformatics and visualization pipeline for microbiome data analysis. BMC Bioinformatics, 20. https://doi.org/10.1186/S12859-019-2965-4



Appendix

Project main tree

.
├── LICENSE
├── README.md
├── Rplots.pdf
├── config
│   ├── config.yml
│   ├── pbs
│   ├── pe_samples.tsv
│   ├── pe_units.tsv
│   ├── se_samples.tsv
│   ├── se_units.tsv
│   └── slurm
├── dags
│   ├── rulegraph.png
│   └── rulegraph.svg
├── images
│   ├── funnels.png
│   ├── project_tree.txt
│   ├── qc_hist.png
│   ├── qc_hist.svg
│   ├── samples_hist.png
│   ├── samples_hist.svg
│   └── smkreport
├── imap-read-quality-control.Rproj
├── index.Rmd
├── library
│   ├── apa.csl
│   ├── imap.bib
│   └── references.bib
├── reporrt.html
├── report.html
├── resources
│   ├── metadata
│   └── reads
├── results
│   ├── project_tree.txt
│   └── qc
├── smk.css
├── styles.css
├── test.Rmd
└── workflow
    ├── Snakefile
    ├── envs
    ├── report
    ├── rules
    ├── schemas
    └── scripts

18 directories, 28 files



Troubleshooting of FAQs

  1. Question
    • Answer
  2. Question
    • Answer